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I  iL 


Abstract 


CASE# 


2.  GMM  Sufficient  statistics:  the  Path  to 
Supervectors 


We  present  a  new  perspective  on  the  subspace  compensation 
techniques  that  currently  dominate  the  field  of  speaker  recogni¬ 
tion  using  Gaussian  Mixture  Models  (GMMs).  Rather  than  the 
traditional  factor  analysis  approach,  we  use  Gaussian  modeling 
in  the  sufficient  statistic  supervector  space  combined  with  Prob¬ 
abilistic  Principal  Component  Analysis  (PPCA)  within-class 
and  shared  across  class  covariance  matrices  to  derive  a  fam¬ 
ily  of  training  and  testing  algorithms.  Key  to  this  analysis  is 
the  use  of  two  noise  terms  for  each  speech  cut:  a  random  chan¬ 
nel  offset  and  a  length  dependent  observation  noise.  Using  the 
Wiener  filtering  perspective,  formulas  for  optimal  train  and  test 
algorithms  for  Joint  Factor  Analysis  (JFA)  are  simple  to  derive. 
In  addition,  we  can  show  that  an  alternative  form  of  Wiener  fil¬ 
tering  results  in  the  i-vector  approach,  thus  tying  together  these 
two  disparate  techniques. 

Index  Terms:  speaker  recognition,  Gaussian  mixture  model, 
Wiener  filter,  probabilistic  principal  components  analysis 
(PPCA),  factor  analysis 

1.  Introduction 

Modeling  speakers  with  GMMs  and  then  generating  test  cut 
scores  by  evaluating  the  likelihood  of  each  possible  speaker 
has  long  been  a  successful  method  in  speaker  recognition  [1], 
In  the  last  few  years,  subspace  methods  have  been  shown  to 
provide  both  convenient  models  for  channel  compensation  as 
well  as  rapid  speaker  enrollment,  particularly  with  the  JFA  ap¬ 
proach  [2].  More  recently,  the  subspace  parameters  themselves, 
referred  to  as  i-vectors,  have  been  used  for  recognition  [3], 

In  this  paper  we  present  an  alternative  perspective  on  these 
algorithms  based  on  sufficent  statistic  scoring,  Gaussian  ob¬ 
servation  and  channel  noises,  PPCA  covariance  modeling,  and 
Wiener  filtering.  The  structure  of  the  paper  is  as  follows.  First, 
in  Section  2  we  present  GMM  scoring  using  sufficient  statistics 
and  introduce  the  supervector  observation  and  channel  noises. 
A  Gaussian  model  for  the  channel  noise  results  in  a  simple 
Gaussian  likelihood  evaluation  in  the  model  supervector  space; 
the  use  of  a  structured  covariance  matrix  with  PPCA  simplifies 
the  evaluation  formulas.  Section  2.3  then  introduces  the  concept 
of  Wener  filtering  in  the  supervector  space,  and  shows  that  this 
leads  to  straighforward  derivations  of  the  JFA  train  and  test  for¬ 
mulas.  In  Section  3,  we  show  that  reversing  the  order  of  the 
Wener  filter  and  removing  the  observation  noise  rather  than  the 
channel  noise  results  in  i-vector  approaches.  Section  4  provides 
experimental  results  comparing  these  various  approaches  on  the 
NIST  SRE10  evaluation.  Finally,  concluding  remarks  are  pro¬ 
vided  in  Section  5. 


We  begin  with  a  review  of  GMM  model  training  and  testing  pro¬ 
cedures  presented  in  terms  of  Gaussian  supervector  sufficient 
statistics  and  Wiener  filtering.  We  start  with  the  assumption 
that  all  models  differ  only  in  the  mean  parameters.  To  be  more 
specific,  we  assume  that  each  speaker  can  be  represented  by  a 
GMM  with  speaker-specific  means  but  shared  weights  and  co- 
variance  matrices. 

2.1.  Sufficient  Statistic  Scoring 

Traditionally,  the  likelihood  of  a  set  of  vectors  under  a  GMM 
model  is  evaluated  by  directly  computing  the  likelihoods  for 
each  frame  and  multiplying  them  together,  a  process  which  we 
refer  to  as  frame-by-frame  scoring.  However,  it  is  well  known, 
and  the  basis  for  maximum  likelihood  training  of  the  parameters 
of  a  GMM  via  the  EM  algorithm,  that  this  likelihood  can  equiv¬ 
alently  be  evaluated  by  first  computing  the  sufficient  statistics 
of  the  input  vectors  and  then  using  a  single  formula  for  the  total 
likelihood.  We  refer  to  this  as  sufficient  statistic  scoring.  For 
each  Gaussian,  the  statistics  needed  are  the  counts,  sum,  and 
sum  of  squares  of  the  vectors  assigned  to  that  Gaussian;  in  ad¬ 
dition  there  is  an  overall  statistic  related  to  the  mixture  counts. 
Note  that  for  a  GMM  (unlike  a  single  Gaussian),  these  statistics 
are  model-specific,  since  the  model  parameters  are  needed  to 
generate  the  alignment  of  input  vectors  to  Gaussians. 

In  genera],  sufficient  statistic  scoring  gives  the  same  answer 
as  frame-by-frame  scoring  but  provides  no  computational  ad¬ 
vantage,  so  it  is  not  used  in  the  testing  process.  Its  primary  use 
is  in  the  theory  of  model  training,  since  it  provides  formulas 
for  the  derivatives  necessary  to  find  optimal  GMM  parameters. 
However,  it  can  also  be  very  helpful  for  testing  in  the  partic¬ 
ular  instance  that  the  alignment  is  already  given  (for  example 
from  the  UBM  rather  than  model-dependent)  the  likelihood  ra¬ 
tio  will  be  computed  between  two  GMMs  that  only  differ  in 
means,  with  the  same  weights  and  variances.  In  this  case,  only 
one  set  of  sufficient  statistics  is  needed  for  all  models,  and  only 
the  counts  and  sum  (or  equivalently  sample  mean)  are  needed. 
We  note  that  for  a  straightforward  GMM-UBM  speaker  recog¬ 
nition  system,  this  assumption  of  UBM-alignment  gives  a  small 
performance  degradation  (on  the  order  of  10%)  as  compared 
to  using  the  correct  per-model  statistics.  However,  the  use  of 
a  single  set  of  sufficient  statistics  is  critical  for  computational 
feasibility  of  the  more  sophisticated  techniques  to  be  described 
next. 

The  evaluation  of  GMM  likelihood  for  a  set  of  vectors  then 
reduces  to  a  single  Gaussian  evaluation  in  a  vector  space  of  all 
the  Gaussian  means  stacked  together,  which  we  refer  to  as  a 
supervector.  This  Gaussian  likelihood  for  model  i  is  given  by 


This  work  was  sponsored  by  the  Department  of  Defense  under  Air 
Force  Contract  FA8721-05-C-0002.  Opinions,  interpretations,  conclu¬ 
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sarily  endorsed  by  the  United  States  Government 


p(x|Sj)  ~  N(mt,  £„)  (1) 

where  x  is  the  test  sample  mean  supervector,  m,  is  the  model 
mean  supervector,  and  is  an  observation  noise  that  shrinks 


to  zero  as  the  number  of  vectors  increases.  More  specifically, 
if  the  GMM  covariances  are  diagonal  then  they  can  also  be 
stacked  into  a  supervector  So  and  the  observation  noise  is  a 
diagonal  covariance  matrix  in  the  supervector  space  where  each 
diagonal  element  is  the  corresponding  covariance  from  So  di¬ 
vided  by  the  count  for  this  Gaussian. 

The  corresponding  log-likelihood  can  be  written  as 

logp(x|£j)  =  (x  -  mi)TS“1(x  -  nu)  +  Co  (2) 

where  Co  is  a  constant  which  is  the  same  for  all  models  and  can 
be  ignored. 

2.2.  The  Additive  Noise  Model 

This  observation  noise  is  equivalent  to  an  additive  Gaussian 
noise,  since  under  this  model  we  observe 

x  =  m<  +  n  (3) 

where  n  is  Gaussian  with  zero  mean  and  covariance  E».  This 
leads  naturally  to  the  question:  what  if  the  observed  supervec¬ 
tor  is  also  corrupted  by  a  channel  (or  session)  noise?  For  exam¬ 
ple,  suppose  that  the  feature  vectors  are  log  filterbank  energies, 
and  the  test  sequence  is  from  a  new  channel  with  unknown  fre¬ 
quency  response  resulting  in  an  additive  offset  to  the  log  filter- 
banks.  In  this  case  we  can  write 

x  =  mi  +  c  +  n  (4) 

where  c  is  an  unknown  offset  (additive  noise)  due  to  the  chan¬ 
nel  for  this  recording.  If  we  assume  that  c  is  Gaussian  with  zero 
mean  and  covariance  Xc  then  this  is  a  straightforward  mathe¬ 
matical  problem.  Since  both  noise  terms  are  Gaussian,  the  cor¬ 
responding  likelihood  for  speaker  i  is  also  Gaussian  and  is  given 
by 

p(x|Si)  ~  N( mt,  +  S„)  (5) 

We  refer  to  this  technique  as  full  Gaussian  scoring. 

To  estimate  Sc,  we  can  use  a  large  training  set  of  model 
variations  between  well-trained  models  and  test  cuts  and  com¬ 
pute  a  sample  covariance  matrix.  This  is  often  referred  to  as  the 
within-class  covariance.  Without  additional  constraints,  how¬ 
ever,  this  covariance  matrix  will  be  extremely  large  (the  square 
of  the  supervector  dimension).  For  a  2048  mixture  Gaussian 
with  60  dimensional  feature  vectors,  the  supervector  size  is 
122, 280  so  a  full  covariance  matrix  has  more  than  1010  param¬ 
eters.  It  would  be  much  simpler  if  Sc  were  diagonal,  but  this 
is  not  a  realistic  assumption.  For  example,  in  our  hypothetical 
frequency  response  case  c  would  be  identical  for  each  individ¬ 
ual  Gaussian  in  the  supervector,  so  many  components  would  be 
highly  correlated.  A  straightforward  approach  to  reducing  the 
number  of  parameters  in  the  channel  covariance  matrix  is  with 
Principal  Component  Analysis  (PCA),  where  we  keep  only  the 
eigenvectors  of  the  covariance  corresponding  to  the  q  largest 
eigenvalues.  This  is  equivalent  to  assuming  that  the  channel 
variation  lies  in  a  subspace  of  the  supervectors.  An  even  more 
powerful  approach  is  Probabilistic  PCA  (PPCA)  [4],  in  which 
the  covariance  matrix  also  includes  a  constant  diagonal  term  so 
that  it  spans  the  entire  space: 

£c  =  UcUj  +  a2 1 .  (6) 

If  PCA  or  PPCA  is  used,  the  computation  of  the  inverse  of  the 
covariance  matrix  needed  to  evaluate  the  likelihood  of  a  model 
can  be  greatly  simplified  by  the  matrix  inversion  lemma.  This 
formula  requires  only  the  inversion  of  a  qxq  matrix  rather  than 
a  fullsize  one: 

(UUT  +  D)~l  =  D~l  -  D~1lT(I  +  UtD~1U)~1UtD~1. 

(7) 


2.3.  Wiener  Filtering  for  Channel  Compensation 

As  an  alternative  to  evaluating  the  likelihood  of  the  test  se¬ 
quence  under  both  observation  and  channel  noise,  it  might 
be  simpler  to  compensate  for  the  channel  noise  with  a  pre¬ 
processing  step.  This  is  the  vector  space  equivalent  of  a  noise 
suppression  algorithm  for  time  signals.  Minimizing  the  mean 
square  error  between  the  clean  and  compensated  supervectors 
results  in  a  matrix  Wiener  filter  [5], 

2.3.1.  Speaker-Dependent  Channel  Compensation 

Recall  our  modeling  assumption  of  Eq.  4.  The  MMSE  estimate 
of  the  channel-compensated  supervector  assuming  the  model 
mean  is  known  is  given  by  the  Wiener  filter: 

x  =  +  £n)“l(x  -  ms)  +  mi  (8) 

Equivalently,  we  can  first  estimate  the  channel  supervector  and 
then  subtract  it  from  the  input  using: 

Ci  -  SC(SC  +  E„)-1(x  -  mi)  (9) 

x  =  x  —  ct.  (10) 

In  either  case,  we  then  evaluate  the  model  likelihood  assum¬ 
ing  only  observation  noise  with  Eq.  5.  If  PCA  or  PPCA  is  used 
to  model  the  channel  covariance,  the  matrix  inversion  required 
for  Wiener  filtering  is  the  same  one  needed  for  the  full  Gaussian 
scoring  in  the  previous  section,  so  the  matrix  inversion  lemma 
can  again  be  used  to  avoid  a  large  matrix  inversion. 

2.3.2.  Speaker-Independent  Channel  Compensation 

Channel-compensated  GMM  scoring  requires  computing  a  new 
channel  offset  for  each  speaker.  If  many  models  are  to  be 
scored,  it  is  tempting  to  reduce  complexity  by  using  the  same 
offset  for  all  models.  The  assumption  of  a  model-independent 
channel  offset  also  allows  for  the  possibility  of  feature  domain 
pre-processing  of  the  input  signal  [6],  A  simple  approximation 
is  commonly  used  for  this,  namely  to  assume  the  model  is  actu¬ 
ally  the  UBM  so  that: 

c  =  Sc  (Sc  +  1(x  —  mo).  (11) 

Unfortunately,  this  assumption  of  a  single  channel  offset  for 
all  models  does  provide  some  performance  degradation,  For¬ 
tunately,  another  simplification  referred  to  as  linear  (or  inner 
product)  scoring  experimentally  seems  able  to  compensate  for 
this  loss  [7].  This  consists  of  approximating  the  Gaussian  eval¬ 
uation  with  only  the  linear  term: 

(x  -  rni)71!!'1  (x  -  mi)  =  -2ftT5£1m4  +  Ci  (12) 

We  would  argue  that  a  better  approach  to  model- 
independent  compensation  would  be  to  use  the  MMSE  estimate 
of  channel  offset  when  the  model  is  unknown,  which  is  given 
by  the  following  Wiener  filter: 

c  =  Se(Sfl  -f-  Ec  +  £n)  L(x  —  mo)  (13) 

where  we  need  to  know  the  mean  and  covariance  of  the  model 
means,  mo  and  which  will  be  discussed  in  the  next  section. 
Unfortunately  we  have  found  that  using  this  equation  in  speaker 
recognition  does  not  work  well;  the  reasons  for  this  are  not  yet 
clear. 


2.4.  GMM  Model  Training  by  Wiener  Filtering 

We  can  also  use  this  formalism  to  derive  the  optimal  estimate 
of  the  model  mean  for  a  new  speaker  enrollment.  We  begin 
without  channel  distortion,  in  which  case  a  speaker’s  training 
data  can  be  modelled  by  a  sample  mean  supervector  corrupted 
by  additive  observation  noise  as  given  by  Eq.  3.  The  MMSE 
estimate  of  the  model  mean  can  be  attained  by  Wiener  filtering 
to  remove  the  observation  noise  n: 

m*  =!!,(£,  +  Sn)_1(x  -  m0)  +  m0.  (14) 

We  assume  that  the  mean  of  all  model  means  mo  (“typical 
speaker”)  is  given  by  the  UBM.  In  more  general  terminology, 
the  speaker  covariance  £„  is  referred  to  as  the  across-class  co- 
variance  matrix.  Similarly  to  the  within-class  case,  we  estimate 
this  covariance  using  a  sample  covariance  of  model  means  over 
a  large  training  set,  and  we  need  to  assume  a  structured  covari¬ 
ance  matrix  to  reduce  the  number  of  parameters.  If  we  assume 
E„  is  diagonal  with  the  form  of  a  constant  times  the  GMM  co- 
variance  So,  this  corresponds  exactly  to  relevance  MAP  adap¬ 
tation  of  a  model  from  the  UBM  [1],  If  we  use  a  PCA  struc¬ 
ture,  this  becomes  eigenvoice  modeling.  The  advantage  of  the 
eigenvoice  approach  is  faster  training  with  small  amounts  data, 
since  all  Gaussians  are  updated  even  if  only  some  are  seen  in 
training.  Unfortunately,  when  a  large  amount  of  data  is  avail¬ 
able  the  eigenvoice  approach  does  not  converge  to  the  correct 
model  unless  the  subspace  assumption  is  exactly  correct.  With 
PPCA,  we  get  a  more  general  representation  of  the  extended 
MAP  (EMAP)  approach  which  combines  fast  adaptation  speed 
with  complete  convergence  [8].  Note  that  if  we  normalize  all 

supervectors  to  the  GMM  covariance  (by  multiplying  by  e|  ) 
then  the  constant  PPCA  term  corresponds  to  relevance  MAP; 
we  use  this  normalization  in  all  of  our  experiments. 

In  the  presence  of  channel  noise,  a  single  enrollment  cut  is 
again  represented  by  Eq.  4.  Now  the  MMSE  estimate  requires 
removing  both  channel  and  observation  noise  by  Wiener  filter¬ 
ing: 

mi  =  E,(£a  +  £c  +  £„)“a(x -m0)  +m0.  (15) 

which  is  equivalent  to  the  two-stage  process  of  channel  bias  es¬ 
timation  with  Eq.  13  followed  by  mean  estimation: 

m;  =  £,(£,  -(-  £„)-1(x  -  c—  mo)  +  m0.  (16) 

This  can  be  interpreted  as  channel  compensation  followed  by 
EMAP  training.  Note  that  the  equation  for  estimating  channel 
offset  is  different  for  training  than  it  was  for  test,  since  here  the 
actual  model  is  not  yet  known  resulting  in  an  additional  Gaus¬ 
sian  uncertainty. 

These  are  the  equations  for  a  single  enrollment  cut.  The 
precise  equations  for  multiple  enrollments  are  complicated,  but 
a  common  approximation  is  to  perform  channel  compensation 
on  each  cut  and  then  sum  statistics  for  the  final  EMAP  training. 
More  precisely,  though,  the  amount  of  channel  compensation 
needed  should  be  reduced  as  the  number  of  cuts  increases,  since 
the  channel  will  be  averaged  out  automatically  even  without  ex¬ 
plicit  compensation. 

The  combination  of  a  PCA  within-class  (channel)  covari¬ 
ance  with  a  PCA  or  PPCA  across-class  covariance  provides  ex¬ 
actly  the  Joint  Factor  Analysis  equations  [2], 

3.  Reversing  the  Order:  I-vectors 

So  far,  we  have  used  Wiener  filtering  in  two  ways:  at  test  time 
to  remove  channel  noise,  and  during  training  to  remove  both 
channel  and  observation  noise.  Here  we  explore  an  alternative 
possibility  of  reversing  the  order  during  testing:  remove  the  ob¬ 
servation  noise  first  and  then  evaluate  the  likelihood  of  the  chan¬ 
nel  noise. 


Again  we  start  with  our  modeling  assumption  of  Eq.  4,  but 
now  we  obtain  the  MMSE  estimate  of  the  observation  noise- 
compensated  supervector  using  a  Wiener  filter: 

x  =  SC(EC  4-  E„)-l(x  -  mi)  +  mi  (17) 

We  then  evaluate  the  model  likelihood  using  only  channel  noise: 

p(x|Si)~N(mi,Ec)  (18) 

Note  that  this  is  equivalent  to  estimating  the  channel  with  Eq.  9 
and  then  evaluating 

p(c4|Si)~lV(0,Ee)  (19) 

This  equation  is  straightforward  with  a  PPCA  model  for 
Ec,  but  we  can  expand  the  key  term  of  the  log  likelihood  to 
gain  additional  insight: 

cf  Er'£r  =  (x  -  mi)r(Ec  +  EnJ^EcE^Ec 
(Ec  +  E„)^(X-mi) 

For  a  PCA  channel  covariance  (equivalent  to  a  PPCA  covari¬ 
ance  as  ct2  — »  0),  £c  =  U0Uj ,  and 

cfE~1ei  =  ztTzi  (20) 

where  Zi  is  the  low-dimensional  component  of  c,  before  map¬ 
ping  back  to  the  full  supervector: 

zj  =  yf(Ec  +  En)-1(S-m1)  (21) 

This  shows  that  we  can  equivalently  evaluate  the  likelihood  of 
a  speaker  model  with: 

p(zt|Sr)~N(0,/?c)  (22) 

Therefore,  Wiener  filtering  the  observation  noise  rather 
than  the  channel  noise  results  in  the  evaluation  of  the  log- 
likelihood  of  the  particular  model  as  a  simple  inner  product  in 
the  low-dimensional  channel  space.  This  <j,: -dimensional  vector 
is  referred  to  as  an  i-vector  [3),  and  in  this  case  it  could  also  be 
referred  to  as  a  speaker-dependent  channel  factor. 

3.1.  Model  Independent  I-vectors 

In  a  fashion  similar  to  the  channel  compensation  approach,  we 
can  replace  the  model-dependent  observation  noise  compensa¬ 
tion  Wiener  filter  with  a  model-independent  one  based  on  the 
UBM: 

x  =  (E,  4-  Ec)(Ea  4-  Ec  4-  E n)  1(x  —  mo)  4-  mo  (23) 

We  can  simplify  this  notation  by  introducing  the  total  covari¬ 
ance  as  the  sum  of  the  channel  (within-class)  and  model  (across- 
class)  covariances:  Etot  =  Es  +  Ec. 

If  we  use  PCA  modeling  for  both  the  channel  and  model 
covariances,  and  assume  that  both  lie  in  the  same  subspace, 
then  a  similar  limiting  approach  as  above  leads  to  the  following 
expression  for  evaluating  the  likelihood  of  the  test  observation 
given  a  model: 

p{z\Si)~N{m*,'2l)  (24) 

where  z  is  the  low-dimensional  component  of  the  mean- 
removed  x  before  mapping  back  to  the  full  supervector: 

z  =  C/Jt(E(ot  +  E„)^1(x-m0)  (25) 

and  mf  and  £*  represent  the  model  mean  and  channel  covari¬ 
ance  in  the  subspace. 

This  result  shows  that  Wiener  filtering  the  observation 
noise  in  a  model-independent  fashion  implies  that  the  log- 
likelihood  of  a  particular  model  is  a  simple  Gaussian  evaluation 
of  the  channel  covariance  in  the  low-dimensional  total  variabil¬ 
ity  space.  This  q-dimensional  vector  is  another  example  of  an 
i-vector,  in  this  case  a  total  factor. 


System 

Ma 

le 

1c 

K9 

lc 

■a 

■  H'l'i'in i  tm 

14.60 

mkhim 

GMM-UBM  zt 

11.40 

5.24 

13.84 

7.33 

GMM  stat 

18.87 

7.44 

21.99 

8.19 

GMM  stat  zt 

11.18 

5.24 

14.34 

7.33 

JFA  full 

3.68 

0.95 

4.56 

1.30 

JFA  SI  WF 

7.1 

2.0 

8.2 

2.3 

JFA  SI  linear 

3.64 

0.48 

4.56 

1.14 

FA  SI  linear 

4.13 

0.48 

5.55 

0.91 

ivec  SD 

7.22 

4.35 

10.98 

8.19 

ivec  SI 

3.32 

0.60 

5.06 

1.72 

ivec  cosine 

3.31 

0.62 

4.94 

1.72 

Table  1:  EER  Performance  for  NIST  SRE10  Extended  Evalua¬ 
tion  Telephone  Data 


4.  Experimental  Results 

We  have  compared  the  performance  of  some  of  the  systems  de¬ 
scribed  in  this  paper  on  the  NIST  SRB  2010  extended  eval¬ 
uation  task  [9].  We  used  a  modified  version  of  the  MITLL 
JFA  submission  [10],  using  a  512-mixture  GMM  based  on  39- 
dimensional  telephone-bandwidth  cepstral  features  including 
deltas,  with  feature  mean  and  variance  normalization  to  mitigate 
channel  effects.  The  background  model  and  speaker  covariance 
where  trained  on  Switchboard  II  as  well  as  SRE  2004, 2005,  and 
2006  telephone  data.  Channel  covariance  training  used  the  same 
data  except  for  Switchboard.  The  PPCA  dimension  for  speaker 
space  was  300,  and  for  PCA  dimension  for  channel  noise  was 
100.  For  the  i-vector  approach,  the  total  PCA  dimension  was 
400. 

The  following  systems  were  tested: 

•  GMM-UBM:  straightforward  GMM-UBM  system  with 
frame-by-if  ame  scoring 

•  GMM-UBM  zt:  as  above  with  ZT-norm 

•  GMM  stat:  GMM-UBM  using  UBM-aligned  statistics 
for  scoring 

•  GMM  stat  zt:  as  above  with  ZT-norm 

•  JFA  full:  PPCA  speaker  model,  PCA  channel,  full  Gaus¬ 
sian  scoring,  ZT-norm 

•  JFA  SI  WF:  JFA  with  speaker-independent  Wiener  filter¬ 
ing 

•  JFA  SI  linear:  JFA  with  speaker-independent  channel 
compensation  and  linear  scoring 

•  FA  SI  linear:  as  above  but  diagonal  speaker  model 

•  ivec  SD:  speaker-dependent  i-vector  system 

•  ivec  SI:  speaker-independent  i-vector  system 

•  ivec  cosine:  reference  approach  using  cosine  scoring  [3], 

From  the  EER  and  minimum  DCF  results  in  Tables  1  and  2,  we 
make  the  following  conclusions: 

■  Sufficient  statistic  scoring  results  in  a  slight  performance 
degradation  which  is  eliminated  by  ZT-noim. 

•  All  forms  of  channel  compensation  provide  drastic  im¬ 
provement  with  the  exception  of  JFA  SI  WF  and  ivec 
SD. 

•  Fast  speaker  adaptation  with  EMAP  provides  modest 
gain  (JFA  vs.  FA) 

•  Scoring  observation  noise  only,  channel  noise  only,  or 
both  all  result  in  similar  performance. 


Table  2:  Normalized  minDCF  Performance  for  NIST  SRE  10 
Extended  Evaluation  Telephone  Data 


5.  Conclusion 

In  this  paper  we  have  presented  an  alternative  perspective  on 
subspace  modeling  in  GMM-based  speaker  recognition.  Work¬ 
ing  with  sufficient  statistics,  Gaussian  PPCA  speaker  and  chan¬ 
nel  models,  and  Wiener  filtering  provided  a  straightforward  ap¬ 
proach  to  deriving  JFA  algorithms.  An  alternative  approach 
to  Wiener  filtering  yielded  the  i-vector  approach,  showing  that 
both  JFA  and  i-vectors  can  be  derived  from  a  single  formalism. 
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